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Abstract 



Dynamical properties of image restoration and hyper-parameter estimation 
are investigated by means of statistical mechanics. We introduce an exactly 



'^ ' solvable model for image restoration and derive differential equations with 



respect to macroscopic quantities. From these equations, we evaluate re- 
laxation processes of the system to the equilibrium state. Our statistical 

^^ ■ mechanical approach also enable us to investigate the hyper-parameter esti- 

C^ \ 

mation by means of maximization of marginal likelihood by using gradient 
decent and EM algorithm from dynamical point of view. 
PACS numbers : 02.50.-r, 05.20.-y, 05.50.-q 
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I. INTRODUCTION 

As a typical massive system, image restoration based on the Markov random field 
(MRF) model has been investigated by statistical mechanical technique of disordered 
spin systems [IIj-Q. Among these results, statistical mechanical analysis succeeded in eval- 
uating measure of successfulness for image restoration and made their hyper-parameter 
dependence clear ||^-§] . However, all of those researches were restricted to studies of static 
properties of image restoration. In the context of Bayesian statistical approach, we usu- 
ally use the Markov chain Monte Carlo (MCMC) method to obtain maximum a posteriori 
(MAP) estimate by simulated anneahng [Q, or to calculate expectations over posterior 
distribution for maximum posterior marginal (MPM) estimation 0. In the recent study 
by Nishimori and Wong 0, they introduced an infinite range mean-field version of the 
MRF model and calculated the overlap between the original image and restored one an- 
alytically. However, they did not investigate the dynamical process of image restoration, 
that is to say, the process of the MCMC method by Glauber dynamics to obtain the MPM 
estimate. Although it is worth while to investigate such dynamical processes in image 
restoration, relatively little progress has been made in the theoretical understanding of 
them. Recently, Inoue and Carlucci ^ investigated dynamical properties of gray-scale 
image restoration using the mean-field Q-Ising spin glass model analytically. They found 
that the MPM estimate gets worse than the degraded image when one fails to set the 
hyper-parameters appropriately. Therefore, it is important to study how we should infer 
the optimal hyper-parameters. As an approach to estimate the optimal hyper-parameters, 
maximum marginal likelihood (MML) method has been used by many authors in practical 
situations [^|18|. If one maximizes the marginal likelihood by gradient descent, Boltzmann 
machine-type learning equations are obtained and these equations contain expectations 
over both posterior and prior distributions. In order to carry out those expectations, we 
usually use the MCMC method. However, it is hard to evaluate the performance of the 



MML estimation due to difficulties to simulate the thermo-dynamically equilibrium state 
within reliable precision. Therefore, we need some analytical and rigorous studies on the 
hyper-parameter estimation. Obviously, the learning process of the hyper-parameter esti- 
mation and the stochastic process of the MCMC method as dynamics. From a view point 
of statistical mechanics of spin systems, the process of the hyper-parameter estimation 
is regarded as a dynamics of spin system in which coupling constant and field strength 
are time dependent variables. Then, time dependence of these variables is determined 
by the algorithm we choose to maximize the marginal likelihood. As far as we know, no 
studies have ever tried to investigate those dynamical properties analytically. In this pa- 
per, we investigate dynamical properties of image restoration including hyper-parameter 
estimation by using statistical mechanical technique. 

This paper is organized as follows. In the next section, according to Nishimori and 
Wong 0, we explain statistical mechanical formulation of image restoration in the context 
of the MPM estimation. 



In Sec. m, we derive differential equations with respect to macroscopic observables 
of the infinite range mean-field MRF model from microscopic Master equation. By solv- 
ing these differential equations, we discuss relaxation process of image restoration. In 



Sec. 1^, marginal likelihood as a function of hyper-parameters is calculated by replica 
method. We also derive Boltzmann machine-type learning equations to maximize the 
marginal likelihood by gradient descent. Flows in hyper-parameter space are obtained 
by analyzing the learning equations. In the same section, we investigate the performance 
of EM (expectation and maximization) algorithm which is widely used to estimate 
hyper-parameters from incomplete data sets. It is well known that EM algorithm shows 
faster convergence at the beginning of the algorithm than some other algorithm does. 
However, there is no study to make this property clear by using some solvable models. In 
this section, we compare the performance of EM algorithm with that of gradient descent 
explicitly. The final section is devoted to summary. 



II. STATISTICAL MECHANICAL FORMULATION FOR IMAGE 

RESTORATION 

In this section, we explain how we formulate image restoration as a problem of dis- 
ordered spin system. According to Nishimori and Wong 0, we consider black and while 
image. Then, an original image is denoted by A^-dimensional vector {^} = {^i, ^2, ■ ■ ', ^n) 
and each pixel ^j takes ±1. These pixels are located on an arbitrary lattice in two di- 
mension. In order to treat image restoration by statistical mechanics of disordered spin 
systems, we should assume that the original image is given by a priori Boltzmann-Gibbs 
distribution 



where J2ij{' ■ ■) is carried out for all nearest neighboring pixels. Thus, we use a snapshot of 
the MCMC simulation for the ferromagnetic Ising model as an original image. Ts{= f3^^) 
appearing in the argument of the exponential (0) corresponds to temperature. We obtain 
pictures of all black or all white when we set Tg -^ 0, while we obtain random noise 
pictures in the limit of T^ —>■ 00. A particular original image {^} is degraded to a 
particular damaged picture {r} = (ri, r2, ■ ■ ■, tat) by noise channel represented by the 
following conditional probability 

where sum J2i{- " ") is carried out for all pixels and we assumed that each pixel is degraded 
independently. jSr represents a noise level of the channel because the above expression 
is rewritten as P{—C,i\C,i) = p = 1 — P{^i\C,i) with p = e~'^^/(e^^ + e~^^) for all pixels 
independently. Therefore, this kind of noise is referred to as Binary Symmetric Channel 
(BSC). 

The BSC is easily extended to the Gaussian Channel (GC) as follows. 



piMim = (Tfi^-p (-^^^V^) = ^-({-»-p (^^^^^^) (3) 



ir^(,({^}) = _^ exp (_?M±^] (4) 

If we replace Fqc{{t}) appearing in Eq. (^ by 

with To/r^ = /?T-, the BSC (Eq. (Q)) is recovered. We should notice that a sum J2t ^({t}) 
for an arbitrary function Q{{t}) is calculated in terms of -Fgc,bsc({t}) as 

E ^({^}) = / ■ ■ ■ / 4r}FGC.BSc({r})fi({r}) (6) 

where we defined d{T} = dTidT2 ■ ■ -drj^i. Then, Bayes theorem gives the posterior distri- 
bution 

where J and h are hyper-parameters and we introduced models of the prior (Eq. ([l|)) and 
the hkelihood (Eq. @j) as 

P(M) = ^"P("^""-"^), P({.}|M)--P(''S.-.-<) 



respectively. A configuration {cr} = (cri, (T2, ■ ■ ■ , (Jn) denotes an estimate of a particular 
original image {^}. Zn and Zj;, in Eq. (|]) are normalization constants given by 



'^ \ ij J T \ i / 



(9) 



It is important for us to bear in mind that Z^ is independent of {a} for both the BSC 
and the GC. Actually, Zl leads to 

^.^/.../<*WfBsc(W)exp(/,i:^ - (1^)" (10) 

for the BSC and 



Zl = J ■■■Jd{r}Foc{{r})exp (hY^naj = exp ^-^ + ^— j (11) 

for the GC. 

In the context of MAP estimation, we choose the estimate {a} as a grand state of the 
following Hamiltonian (cost function) 

ij i 

In order to obtain the grand state, we usually use simulated annealing H] or mean field 



annealing [|T0 



On the other hand, in the context of MPM estimation, we first calculate the marginal 
distribution around a single pixel ai : 

and we choose the sign of the difference between P{ai = +l|{r}) and P{cx = — l|{r}) as 
an estimate of the i-th pixel ^i as 

6 = argmaxP(a.|{r}) = sgn I E^^(^.|{^})] = sgn (^3^MFTr) 

= sgn {{ai)j,h)- (14) 

In this expression, we defined {<Ji)j^h as an average of the i-th pixel ctj over the posterior 
distribution (|^) and this is written explicitly as 

{o-i)j,h = \v- Uy^ • (15) 

This corresponds to a local magnetization of the spin system that is described by the 
Hamiltonian 7i{{a}) at temperature T = 1. Thus, in order to investigate properties of 
the MPM estimation for image restoration, we should study the random field Ising model 
described by T-C{{o-}). Then, we are interested in the quantity : 

M{J,h)^j:P{{^})P{{r}m)^,sgn{{a,)j,,) (16) 



which means the averaged overlap between an arbitrary original pixel ^j and the MPM es- 
timate ^i = sgn((o"j)j^/i). Apparently, the best restoration of the original image is achieved 
when the overlap M is as close to 1 as possible. For this averaged overlap M{J, h), the 
next inequality holds 0. 

M{J,h)<M{(3,,(3r) (17) 

This inequality means that the averaged overlap M takes its maximum when one sets 
the hyper-parameters to their true values, namely, J = j3s and h = (3^-. However, it is 
impossible to derive the hyper-parameter dependence of the overlap around its optimal 
value M{(3s, Pt) from the above inequality. To investigate this dependence, Nishimori and 
Wong introduced a mean-field version of the MRF model and calculated the overlap as 
a function of J and h. The mean-field model is rather an artificial model in which every 
pixel is connected to the others, however, this model is very useful to discuss the behavior 



of macroscopic quantities of the system, like the overlap M. Using replica method |TT 
one obtains saddle point equations 

"^0 = v^ 5Z ^i = tanh(/5^mo) (1^ 



^PsinoS, 

m 



T7 y" CTi = 7^ — 177^5 7 / ^^ tanh( Jm + rhx + Toh^) (19) 

N ^ 2cosh(psmo) -'-oo 

M = -Y, ^S = TT^ 7 / ^^ esgn( Jm + rhx + t^K) (20) 

N ^ 2cosh(p<imo) J-oo 

where we defined Gaussian integral measure by Dx = dxe~^ 1"^ l\phx. Equation (0) de- 
termines macroscopic properties of the original image given by the Hamiltonian — ^^ C,iC,j 
at temperature Ts{= P^^). From statistical mechanical point of view, rriQ corresponds to 
magnetization of the mean- field ferro- magnetic Ising model. For a given Tg, one obtains 
niQ by solving Eq. (|18|). Substituting Tg, mo and noise parameters tq (a center of Gaus- 
sian) and r (a standard deviation) into Eq. (0), one obtains magnetization m for the 
restored image system {cr} as a function of Tm{= J^^) and h. Then, one substitutes 
m{Tm, h) into the expression of M, and finds hyper-parameter dependence of the overlap 



explicitly. In FIG. ^ we plot the overlap M as a function of 1/J(= T^). We set r = tq = 1 
{jSr = tq/t"^ = 1) and temperature of the original image is chosen as Tg = 0.9. The overlap 
for the two cases of the field h, namely, h = PrTgJ = tqTsJ/t^ = 0.9J = /lopt (a) and 
/i = 1 (b) are shown. We should notice that the MAP estimate is obtained in the limit 
of Tm — ^ keeping the ratio h/J constant. Therefore, the overlap for the MAP estimate 
depends on the ratio h/J and takes its maximum when we set h/J = jSrTs = 0.9 (see FIG. 
|l] (a)). From this figure, we see that the overlap takes its maximum at T^ = Tg = 0.9 
and h = (3r = tq/t^ = 1. In the next section, we focus our attention on dynamics of the 
MPM estimation. 

III. DYNAMICS OF IMAGE RESTORATION 

In the previous section, we showed the performance of the MPM estimation. However, 
in those calculations, we assumed that the system already reached the equilibrium state. 
In other words, each state {cr} obeys the Boltzmann-Gibbs distribution ~ e"^'-''^'^^'' . When 
we need to generate the distribution to calculate the MPM estimate sgn((crj)j_ft), we often 
use the MCMC method and simulate the equilibrium states on computer. Therefore, 
it is important to study how the system relaxes to its equilibrium state and grasp the 
behavior of time evolutionary observables analytically. As far as we know, there is no 
research to deal with dynamics of image restoration including hyper-parameter estimation 
analytically. In this section, for the infinite range mean-field MRF model, we derive 
differential equations with respect to macroscopic order parameters of the restored image 
system from microscopic Master equation. 

First of all, we should remember that a transition rate Wfc({o"}) from {a} = 
(cTi, 0-2, ■ ■ -, CTfc, ■ ■ ■ , ctat) to {a}' = ((Ji, 0-2, ■ ■ ■ , -G^,, ■ ■ -, G j^) Icads to 

Wk[{G}) = - [1 - afc tanh[/ifc({a})]] , /ifc({a}) = ^ E ^i + ^^^ (21) 

i 

in the context of the Glauber dynamics of the MCMC method. It is important for us to 



bear in mind that Hamiltonian ?i({cr}) of the system is rewritten in terms of /ifc({o"}) as 

W^}) = -E^^(WK (22) 

k 

where we rescaled the couphng J as J/N to take a proper thermo-dynamic hmit (Hamil- 
tonian should be of order A^). 

Then, probability Pt({cr}) that the system visits a state {a} at time t obeys the Master 
equation 

^^^ = E [Pt{Fd{a}))w,{F,{{a})) - Pt{{a})w,{{a})] (23) 

where we defined single spin flip operator Fk by 

Fk{{<y}) = {cTi, 0-2, • • •, -cTfe, • • •, aN) = {o-}'. (24) 

Distribution Pt{m, a), which is probability that the system has macroscopic order param- 
eters 

m{W}) = ^ E ^*' «({^}) = ]7 E '^i^i (25) 

i i 

at time t, is written in terms of the distribution Pt({^}) of the microscopic state {a} as 
Pt{m, a) = Y.MW})^im - miW}M(^ - «({^})) (26) 

a 

where S{- ■ ■) is a delta function. Taking a derivative of Pt{m,a) with respect to t and 

substituting Eq. (p3|) into this expression and making a Taylor expansion in powers of 

2ak/N and 2Tk(Jk/N (so-called Kramers-Moyal expansion), we obtain 

dPt{m,a) d ^, ,{ ^ e^»™°« /- ^ ,,^ , ^ A 

= ——Pt[m, a) <m- - — ^715 ^ / F>x tanh( Jm + hrx + hro^) } 

at dm [ 2cosh(psmo) J-oo J 

+ ^—Pt(m, a) la- - — ^— — / Dxirx + roOtanh( Jm + hrx + HtqC) \ 

da [ 2cosh[Psmo) J-oo J 

+ 0{N-^). (27) 

Thus, we derived the time dependent distribution of macroscopic quantities from the 
microscopic Master equation Eq. (|23|) . Finally, we construct differential equations with 
respect to the macroscopic quantities m and a. Substituting a form of distribution 
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Pt{m,a) = S{m-m{t))S{a-a{t)) (28) 

into Eq. ( p7[ ) and calculating some integrals, we obtain 

= -m+ - — ^—— / Dx tanh( Jm + hrx + hroO (29) 

at 2cosn(psmo) J-oo 

— - = —a H ^— — / Dx(tx + ro^)tanh( Jm + /irx + /itq^). (30) 

at 2cosh(psmo) J-oo 

These two equations describe a relaxation of the system to the equilibrium state. We 
should notice that order parameter a is a slave variable in the sense that the order param- 
eter m relaxes independently, whereas the relaxation of a depends on m. Therefore, the 
behavior of a is completely determined by m. For this reason, from now on, we disregard 
Eq. dg). 

It is easy to see that in the limit of t ^ oo and dm/dt = 0, the saddle point equation 
(p!9|) is recovered. As the overlap M is written in terms of m [see Eq. (|20|)], time evolution 



of the overlap is obtained by substituting the time dependence of the magnetization m(t) 
into the expression of M. 

Using the same technique as the procedure to derive the differential equation with 
respect to m, the differential equation for the magnetization mi of the prior system 
-P({^}) = exp(JX]jj o"iO"j)/I]o-exp(JX]jj cTjCTj) is obtained as 

drrii 



dt 



-Till + tanh(mi J). (31) 



Although in these equations we regard the hyper-parameters J and h as constant variables, 
one should treat them as time dependent parameters, that is, J{t) and h{t) from a view 
point of hyper-parameter estimation. Of course, details of the time dependence of J(t) 
and h{t) depend on a particular algorithm of hyper-parameter estimation. In the next 
section, we investigate properties of hyper-parameter estimation as dynamical process of 
coupling constant J{t) and field strength h{t). 



IV. HYPER-PARAMETER ESTIMATION 

In the previous two sections, we investigated both static and dynamical properties 
of image restoration. From those results, we obtained hyper-parameter dependence of 
the overlap explicitly. Moreover, for a particular constant hyper-parameter set {J,h), 
we derived the differential equations which describe the relaxation of the system. As 
one of the authors reported in 0] , if one fails to set the hyper-parameters appropriately, 
the restored image gets worse than the degraded image. In practical situations, we do 
not know the optimal value of the hyper-parameters before we carry out the MCMC 
simulations. Therefore, we need to determine the optimal value by using only information 
about the degraded image {r}. Of course, it is possible for us to construct some robust 
algorithms for hyper-parameter tuning and several authors reported such algorithms based 



on selective freezing ||T3[ or quantum fluctuation |jTj]. However, if one seeks the optimal 



restoration, hyper-parameter estimation becomes a very important problem. 



About ten years ago, Iba |12] studied the performance of the MML method with 
the assistance of the MCMC simulations for the same problem as ours. However, as he 
mentioned in his paper, the results are not enough to make its performance clear due to 
the difficulties of simulating the equilibrium state within reliable precision. With this fact 
in mind, in this section, we calculate marginal likelihood as a function of hyper-parameters 
analytically. From the marginal likelihood, we derive Boltzmann machine-type learning 
equations and investigate their behavior quantitatively. 

A. Maximum marginal likelihood method 

In statistics, maximum marginal likelihood (MML) method is used to infer hyper- 
parameters appearing in the posterior distribution [0,^|15|. In the context of image 
restoration, marginal likelihood (logarithm of marginal likelihood) is given by 
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KiJ,h:{^,T})^\ogj:Pi{r}\W})PiW}) 

; gJE., -.-.+^E, -»-A _ log Zu - log Zl (32) 



iog(E< 



where Zn and Z^ are given by Eq. (|^). We should remember that Z^ is independent of 
{a} for both cases of the BSC and the GC Usually, we attempt to maximize the marginal 
likelihood by using gradient descent with respect to J and h. This result leads to the 
following Boltzmann machine-type learning equations 

^ dl ^ dK{J,h:{^,T}) ^ E.(E.,cT,a,)e^^-.'-'-^+"^--'-' _ E.(E., o-.a,)e"^'^^'^- 

^^dh ^ dK{J,h:{^,T}) ^ E.(E.r,a,)e^^-^'-"-^+"^--'-' _ dlogZ^ 
^ dt dh Y^ QJj2ij''i''3+f'J2i'^i''i dh 

where cj and Ch are relaxation times. Thus, by solving these two equations, we maximize 
the marginal likelihood —K{J, h : {^, r}) and obtain the values of hyper-parameters 
as a fixed point of the equations. Then, we should notice that these two equations 
contain expectations of the quantities Ey <^i<^j and Ej ^iCi over the posterior and the 
prior distributions. Therefore, when we solve Eqs. (^3|), (^) numerically, we should 
calculate these expectations at each time step of the Euler method. Iba ||12| carried out the 
MCMC method to calculate the expectations and evaluated time dependence of the hyper- 
parameters J and h numerically. However, the accuracy of his computer simulation is not 
reliable because the time to simulate the equilibrium state is not enough. Accordingly, it 
is worth while to investigate the performance of the MML method analytically using the 
solvable model. In this subsection, we use the infinite range mean-field MRF model and 
solve the learning equations (p^),(plD exactly. 

As our interest is averaged performance of the MML method, we should calculate the 
averaged marginal likelihood : 



[K{J,h:{^,T})]^^^^ 



} 



log ^e^^'^'"'"^"^^^' ''"''' - log E e^ ^'^ '''" 

[log^L]{5,.} (35) 
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where the bracket [■ ■ ■]{g,r} rneans the average over the distribution P{{t}\{C,})P{{^}) and 
the sum J2ij{' ' ') should be carried out for all pairs of pixels. We should keep in mind 
that we rescaled the coupling constant as J/N to make the averaged marginal likelihood 
(difference of free energy) of order A^. In general, it is hard to carry out this kind of 
average, namely, [logZ]{^,r}. Then, we replace the average with an average of the n-th 
moment of Z, that is, Z" by using 

[logZ]^,,r} = limi^3ii-I^. (36) 



This is refereed to as replica method ||rT|. By using the replica method, we obtain the 
averaged marginal likelihood per pixel as 

2cosh(/?smo) 
-m^ - log2cosh(miJ) + — - 



[ir(J,/i:{e,r})]|g,,| ^ _ J^2 _^ ^g^^°"'°\ r Dx log 2cosh( Jm + hrx + hroO 
N 2 2cosh(n,mn) J-00 



+ -ml - log 2cosh(mi J) + ^^ — = -K{J, h) (37) 



where m and mi are magnetizations of the spin systems described by the posterior and 
the prior, respectively. It should be noticed that as we used the GC (Eqs. (|^), (||)), the 
average [logZiJi^,^! simply led to {Nh'^ /2) - {Ntq/2t'^) (see Eq. (|lTD). 

In FIG. 0, we plot the averaged marginal likelihood as a function of J and h. In 
this figure, we see that the averaged marginal likelihood takes its maximum when we 
choose the hyper-parameters (J, h) so as to be identical to their true values {(3s = l/Tg = 
1.1, Pr = tq/t^ = 1) (we set tq = t = l,Ts = 0.9). This fact is easily checked by the 
following inequality |]l6i 



{-[K{Ps,Pr : {e,r})]{e,.}} - {-[K{J,h: {^,r})]{^,r}} 

= E^/3.(MI{0)^/^.({0)iogE^/3.({r}|W)p,.,(W) 

= E^A,/^.({n)log(P;3.,/^.({r})/Pj,.({r})) > (3^ 

T 

where we used non-negativity of Kullback-Libeler information and we defined 
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P,({,}|{,}) . JEE(£|iZi^, PMrm})- '=='p<-^'s-^'«' 



ErewiXEinai 



Erexp(XEi^i6 



(39) 



Pyjia})^ exp(yE.,cr.a,) p^^^^^^ _ exp(FE.,e.O) 



Eaexp(rEi,o-iO", 



E5exp(rEij60 



(40) 



^x,y({r}) = E^x({r}|W)Py(W) = 5:Px({r}|{0)i^y({0). 



(41) 



Thus, we confirm that our mean-field model is not against this general inequality. We 
should mentioned that static properties of the hyper-parameter estimation was investi- 
gated by several authors using the generalized Gaussian model ^^, mean-field approxi- 
mation m and cluster variation method |[T8[] . 

For the marginal likelihood (|35|), averaged learning equations with respect to J and h 
are obtained by gradient descent : 



cj 



dJ 
dt 



dK{J,h:{C,T}) 
dJ 



Ch 



{«,t} 



dh 
'dt 



dK{J,h:{i,T}) 
dh 



(42) 



{C,r} 



The right hand sides of the above equations are also evaluated by the replica method. 
After some algebra, we obtain 



dJ m^ Ef e^^'""^ 

C 7 ^ ~t~ Tfl ^ 

dt 2 2cosh(/3smo) J-oo 

H nil tanh(mi Jj 



/oo 
Z}xtanh(Jr7i + hrx + Htq^) 



dh 



Ch- 



S« 



^(^s-mo^ 



Dx{tx + ro^)tanh( J?7i + hrx + hrQ^) — T^h 



dt 2cosh(/3smo) J-c 
where we should remember that m and mi obey the differential equations 
dm 



(43) 
(44) 



dt 
dm, I 
~df 



— m + 



E^ e^^™o« 



2cosh(/?s^o) J-oo 
—mi + tanh(mi J). 



/oo 
Dx tanh( Jm + hrx + hr^^) 
-oo 



(45) 
(46) 



By solving these coupled equations, we obtain time dependences of the hyper-parameters 
J{t),h{t) and relaxation process of the systems, namely, m(t),mi(t). In this paper, we 
fix the relaxation times as cj = c/^ = 1. 
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In FIG. ^, we plot time dependences of the hyper-parameters J, h and order parameters 
■m,mi. From this figure, we see that the final state of the hyper-parameters is optimal, 
namely, (J^, h^) = (l/Ts,/3^ = tq/t^) = (1.1, 1) and this convergent point is independent 
of the initial conditions. Time evolutions of the overlap M are also plotted in FIG. ^ 
(upper figure). We find that the overlap M converges to the best possible value in FIG. [l|. 
In FIG. 1^, we plot flows of hyper-parameter J-h. From this flgure, we flnd that each flow 
does not take the shortest path to the solution and goes a long way around the solution. 

B. EM algorithm 



In the previous subsection, we investigated the process of the MML method by gradient 
decent as a dynamics. In this section, we analyze the performance of EM algorithm f^ as 
another candidate to maximize the marginal likelihood. 

In EM algorithm, we flrst average the log-likelihood function 

J 



logP({r}|{a})P(M) 



jy J2 ^i^i + hJ2 '^^^^ " l^S H ^Xp 



1-3 



+ 



Ntq Nr^h 



2U2 



2r2 2 

over the time dependent posterior distribution 

it y^ 



(47) 



— I]i,°"»°"j+'**I]i'^''^' 



J2 e^^'i"'"''^^'^'^'''" 



(4J 



This average is referred to as Q-function. As we are interested in the averaged behavior 
of the Q-function, we need the following averaged Q-function : 



Q{J,h\Jt,ht 



EP,({a}|{r})logP({r}|M)P(M) 



J {5,r} 



J 



h 



E.(E.r,cr,)e^^-^--"--+"^^--"-' 
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{i,r] 



{^.r} 



- log5:exp U^E^^^. + ^ - ^— (49) 

'^ \ ^j / 

where we divided the couphng constants J and Jt by A^ to take a proper thermo-dynamic 
hmit. 

Then, EM algorithm is summarized as follows. 

• Step 1. 

Set initial values of the hyper-parameters Jo, ^o and t <— 0. 

• Step 2. 

Iterate the following E (expectation) and M (maximization) steps until an appro- 
priate convergence condition is satisfied. 

— E step : Calculate Q{J,h\Jt,ht). 

— M step : Update Jt and ht by 

Jt+i = argmaxQ(J, h\Jt, ht) 
ht+i = argmaxQ(J, h\Jt, ht) 

h 

and t <— t + 1. 

For our infinite range mean-field MRF model, the averages [■ ■ -Jj^.t} in Eq- ( ^9]) are 
calculated by using the replica method and we obtain 



QiJ,h\Jt,ht) _ Jm{t)^ Jm{t)j:^e 



Psmoi 



/oo 
Dxiai\h{Jt'm{t) + htTX + htTo^) 



N 2 2cosh(/?,mo) 

+ 7^ rriS T / Dx{tx + roOtanh(Jtm(t) + htTX + htToO 

2cosh[ps'mQ) J-oo 

J T^ T^h^ 

+ -m^itf - log2cosh(mi(t)J) + 77^ - ^^. (50) 

2 Zt"^ 2 

At the next time step, Jt+i and ht+i are given by the conditions dQ/dJ = and dQ/dh = 
0. These two conditions lead to non- linear maps : 
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Jt+1 = ^T—tanh ^ 
m[t) 



{m{ty -mi(t)2}2 



2mi(t) 
+ o u^ urn V / ^^ tanh( J,m(t) + /i^rx + hr^ (51) 

/it+i = ^ n , fo V / -^^ (™ + i"oOtanh( Jim (t) + /ijTX + htTo^). (52) 

2r^cosn(psmoj J-oo 

In the above non-linear maps, m{t) and mi{t) are time dependent magnetizations for the 
systems described by the posterior -P({o"}|{t}) and the the prior P{{a}), respectively. By 
using mean-field treatment, we obtain non-linear maps with respect to m{t) and wii(t) as 

m(t + 1) = ^— — / Z^x tanh( Jim(t) + htTX + htTo^) (53) 

2cosh(p5mo) J-oo 

mi{t + 1) = tanh(Jtmi(t)). (54) 



By solving these non-linear maps Eqs. (pT])-(p^), we obtain the time dependence of 
the hyper-parameters Jtiht and the magnetizations m{t),mi{t). We plot the results in 
FIG. § (lower figure), FIG. |^ and FIG. p. From these figures, we see that both the 
MML method by gradient descent and EM algorithm obtain the optimal hyper-parameters 
(J*,/i*) = (1.1,1), however, EM algorithm shows faster convergence than the MML by 
gradient descent. In addition, the fiows of EM algorithm in the hyper-parameter space 
are shorter than those of the MML by gradient descent. From the posterior distribution 
appearing in the Q-function (^), we see that performance of EM algorithm highly depends 
on the initial choice of the hyper-parameters Jo and ho. Therefore, for the systems which 
have lots of local minima, the final solution is sensitive to the initial condition on the 
hyper-parameters. However, for our model system (the infinite range random field Ising 
model), there is no local minima in the marginal likelihood function. As the result, the 
final state of EM algorithm is independent of the initial conditions. 

V. SUMMARY 

In this paper, we investigated dynamical properties of image restoration by using 
statistical mechanics. We introduced an infinite range mean-field version of the MRF 
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model and solved it analytically. We derived differential equations with respect to the 
macroscopic order parameters from the microscopic Master equation. We also studied dy- 
namics of hyper-parameter estimation in the context of the maximum marginal likelihood 
method by using gradient descent and EM algorithm. For the MML method by gradient 
descent, Boltzmann machine-type learning equations were evaluated analytically by the 
replica method. On the other hand, EM algorithm led to non-linear maps and these maps 
were also evaluated analytically. We compared these two algorithms and found that for 
both algorithms, we obtain the optimal hyper-parameters. We also found that the speed 
of convergence for EM algorithm is faster than that of the MML method by gradient 
descent. In addition, the paths to the solution in hyper-parameter space by EM algo- 
rithm are shorter than those of the MML by gradient descent. Thus, in this paper, we 
could compare two different methods to estimate hyper-parameters without any computer 
simulations. Our analytical treatments are applicable to studies of performance for the 
other method including deterministic annealing EM algorithm |T^J2^. Moreover, besides 



image restoration, our approach is useful for the other problems, for example, learning 
by Bayesian neural networks [^l[^, time series predictions [^ or density estimation 
problem [^ . 



We thank Hidetoshi Nishimori, Masato Okada, Yukito Iba and David Saad for fruitful 
discussions. Our special thanks are due to Toshiyuki Tanaka for useful discussions and 
comments. 
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FIGURES 
FIG. 1. 1/J(= Tfn) dependence of the overlap M. Temperature of the original image 

is Tg = 0.9 and the noise revel is /?,- = tq/t^ = l(ro = r = 1). We set the field h as 

h = PrTsJ = {toTs/t'^)J = 0.9J = hopt (a) and /i = 1 (b). In the limit of 1/J -^ 0, we 

obtain the overlap of the MAP estimation. In both cases (a) and (b), the overlap M takes its 

maximum at T^ = Tg = 0.9. 

FIG. 2. J-dependence of the averaged marginal likelihood —K (upper figure). We set 
h = 0.5,1 and h = 1.5. We see that —K takes its maximum when we choose J,h as 
J = 1.1(= l/7s) and /i = /3t- = 1. /i-dependence of the averaged marginal likelihood —K 
(lower figure). We set J = 0.5,1 and J = 2.1. We see that —K takes its maximum when we 
choose J,h as J = 1.1 = (l/Ts) and /i = /?,- = 1. For both figures, we chose {m, mi) as a solution 
of Eq. (|l^) and mi = tanh(Jmi) for J = l/Tg and h = /?,-. 

FIG. 3. From the upper left to the lower right, time dependences of the hyper-parameters 
J, h and the magnetizations m, rrii are plotted. In each graph, we choose the initial condition 
(a) J(0) = 0.45, h{0) = 1, m(0) = mi(0) = 0.4, (b) J(0) = 0.45, h{0) = 0.5, m(0) = mi(0) = 0.4, 

(c) J(0) = 2.25, h{0) = 1, m(0) = mi(0) = 0.4, (d) J(0) = 2.25, ^(0) = 0.5, m(0) = mi(0) = 0.4. 
We set true values of the hyper-parameters Tg = 0.9, /?,- = 1 

FIG. 4. Time dependences of the overlap M for the case of the MML by gradi- 
ent descent (upper figure) and the case of EM algorithm (lower figure). For both cases, 
we choose the initial condition as (a) J(0) = 0.45, /i(0) = l,m(0) = ?n,i(0) = 0.4, (b) 
J(0) = 0.45, /i(0) = 0.5, m(0) = mi(0) = 0.4, (c) J(0) = 2.25, h{0) = l,m(0) = mi(0) = 0.4, 

(d) J(0) = 2.25, /i(0) = 0.5,?n(0) = mi(0) = 0.4. We set true values of the hyper-parameters 
Tg = 0.9, (3r = 1- We see that for both cases, the optimal overlap Mopt is obtained as a fixed 
point of the dynamics. 
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FIG. 5. Flows in the hyper-parameter space {J,h). We set the initial conditions 
J(0) = Jo = 0.45, /i(0) = ho = 1, and m(0) = mi(0) = 0.4 (upper figure) and 
J(0) = Jo = 2.25, /i(0) = /lo = 1 and m{0) = mi(0) = 0.4 (lower figure). True values of 
the hyper-parameters are J* = l/Tg = 1.1, h^, = Pr = 1. For the case of gradient descent (GD), 
the flows go a long way around the solution (J*, /i^,) = (1.1, 1). In order to compare the MML 
by gradient descent with the EM algorithm, we also plot flows of EM algorithm (EM). We see 
that EM algorithm takes shorter paths than the MML by gradient descent. 

FIG. 6. From the upper left to the lower right, time dependences of the hy- 

per-parameters J, h and the magnetizations m, mi for the EM algorithm are plotted. In 
each graph, we choose the initial condition (a) Jo = 0.45, /lo = l,m(0) = mi(0) = 0.4, (b) 
Jo = 0.45, /lo = 0.5, m(0) = mi(0) = 0.4, (c) Jq = 2.25, /lo = l,m(0) = mi(0) = 0.4, (d) 
Jo = 2.25, /lo = 0.5, m(0) = mi(0) = 0.4. We set true values of the hyper-parameters Tg = 0.9, 

/3r = l 
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